In last decade, the Internet has exploded into our lives. What started out as a way to link 1,000 Universities and private labs has morphed into a defining aspect of American culture. It dominates our day to day to live. We use it for everything from reading the news to using social media to playing mobile games. As the Internet creeps more and more into our lives, people have begun to question what we as Americans really get out of using the Internet. Those who think using the Internet often is a positive say that among other things, it is a vehicle to consume the world around you, keeping you more current with the times. However, opponents of high Internet usage argue that the net is used most frequently to post memes to Facebook, play mobile games, and watch movie trailers. None of which are particularly valuable. With it being an election year, this question has implications dealing with politics.
As we enter the meat of primary season, election coverage can be found on online news outlets and social media mediums. In addition, candidates regularly update their Instagram, Facebook and Twitters pages, keeping voters up to date on the latest from the campaign trail. With all this election coverage on the Internet it is natural to ask: Is higher Internet usage associated with higher voting rates? After analyzing the data, the answer is yes. Higher county Internet usage is correlated with higher county voter participation, although it is only a small effect.
Before we dive into the research, one thing must be made clarified. All of my data analysis was done at the count level. While it may be tempting to extend these relationships to the individual level, there would be data to back that up. For example, while the data points to a relationship between increased county Internet usage and higher county voter participation, it says nothing about if an increase in an individual’s Internet usage is associated with him/her being more likely to vote. For this thesis, I looked at the 2012 general election since it was the most current election. I used voting data from USElectionAtlas.org, which collects data from all elections in the US. I used county-level data indicating how many people in each county voted in the election, in addition to which candidate they voted for (limited to either Barack Obama or Mitt Romney). The mean of count level participation is 44%, and the standard deviation is 8%. I defined county voter participation as the amount of people who voted divided by the population of that county. To measure Internet usage, I downloaded 2012 data released from the Federal Communications Commission (FCC), which measured county level Internet activity in a variety of ways (the data can be found at fcc.gov). All in all, I used six different measurements released by the FCC to gauge county Internet usage. The first two variables measured how much data was being demanded by users per county, measured in kilobits (a measurement of bandwidth) per second and normalized for population. The first, Demand General, measured the amount of Residential Fixed Connections using a moderate amount of bandwidth. The other, Demand Fast measured the number of Residential Fixed Connections using a large amount of bandwidth. The difference in variables is that Demand Fast represents people using the Internet often, as opposed to Demand General, which represents people using the Internet moderately. Similarly, Providers Slow, Providers Medium and Providers Fast are measures of the amount of providers supplying small, medium and large amount of bandwidths in a county respectively. The data also included a measure of the number of mobile Internet providers in a county. Lastly, I normalized the provider’s measurement by count population, to account for the fact that larger counties would naturally have more providers than smaller ones.
In addition to Internet measurements, I gathered a set of confounder variables thought would have an effect on county voter participation, regardless of Internet usage. The three I chose were: Median Income, Median Age, and Population. The purpose of including these variables is to better isolate the effect Internet usage has on voter participation. For example, by including Median Income, I can compare Internet and voting rates of counties that have people of approximately the same wealth level. Otherwise, we might chalk up differences in voting rates to Internet usage, when really it was due to a difference in wealth. All three confounding variables were collected from the American Community Survey (data can be found at: census.gov/programs-surveys/acs). After gathering all of my data, I was able to match them all into a table using the FIPS code that came provided with the data (a FIPS code is a unique numeric identifier for each county in the U.S.).
After collecting and cleaning the data, it was time to analyze it. The first thing I did was to check to see what if any correlation there was between the Internet variables and voter participation. Below are the six graphs of the Internet variables plotted against voter participation.
It looks like there is some positive correlation for some of them, but overall it is not very clear.
For a quantitate measurement, I fit six different linear models with county participation as the response, and the different Internet measurements as the predictor. The only two variables that were significant were both of the demand variables. Both of which were positively correlated. This gives some idea there’s some positive correlation there, but to really investigate further I had to fit a multiple linear regression model. In order to try to fit the best model, I had to decide which variables to include in my model. To determine this, I used forward propagation, in which I kept adding predictor variables to my model until an ANOVA test determined that adding that extra variable did not decrease the variance of the fitted values of the model significantly.
##
## Call:
## lm(formula = countyPar ~ demandGen + medIncome1000 + medAge +
## providersFast + providersSlow, data = master)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.22531 -0.03764 -0.00025 0.03646 0.32718
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.0445901 0.0110581 4.032 5.65e-05 ***
## demandGen 0.0087883 0.0016438 5.346 9.63e-08 ***
## medIncome1000 0.0021930 0.0001121 19.561 < 2e-16 ***
## medAge 0.0084563 0.0002251 37.559 < 2e-16 ***
## providersFast 0.0022265 0.0001327 16.774 < 2e-16 ***
## providersSlow -0.0024612 0.0001244 -19.790 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0585 on 3106 degrees of freedom
## Multiple R-squared: 0.4804, Adjusted R-squared: 0.4795
## F-statistic: 574.3 on 5 and 3106 DF, p-value: < 2.2e-16
The first thing to note when looking at this model is that all of the coefficients are significant, as dictated by the extremely low p-values. Also, both confounders are positive, as I predicted. The next thing to take note of is the coefficients themselves. DemandGen, has a value of .0087, meaning that an increase in one unit of demandGen for a county on average is associated with .87 increase in voter participation. This number is small, leading one to believe that this predictor variable is practically insignificant. However, demandGen has a range of one to five. So, a county with the value of five for demandGen is expected to have a county participation rate that is 3.5% higher than a county with a value one for demandGen. This is almost half a standard deviation of the county voter participation data, so we can conclude that higher Internet demand leads to higher county vote participation. At first glance, the negative coefficient of supplySlow is not a good sign if we are trying to show that Iternet usage is positively correlated with voting rates. However it is important to note of the difference between providersFast and providersSlow. We can interrupt the difference between providersSlow and providersFast as the difference in counties that have any type of Internet providers as opposed to counties that that have providers that are supplying a large amount of bandwidth. Therefore, the fact that providersSlow has a negative value while providersFast has positive value can be interrupted as the following: If we were to measure just the amount of internet providers in a county it would lead us to believe there is a negative correlation with the supply of Internet and voter participation. But, by measuring the number of Internet providers supplying a large amount of bandwidth we see that increased Internet supply is correlated with increased voter participation. Below, are maps for all of the variables that in the model, plotted geographically on a US map.
One thing to take note of in the Residuals map is that our model does poorly in certain geographies. Mainly, Louisiana/Alabama, the upper Northwest, and parts of the Northeast. This tell us there are characteristics unique to these regions influencing voting rates that are not being accounted for in the model. Other than that, it is impressive how simillar the predicted voting tate maps looks to the actual voting rates map.
Overall, the results from the positive correlation of Internet demand, the mixed results from Internet supply, and the accuracy of the maps are consistent with the thesis that increased county Internet usage is correlated with higher county voting rates.
But can we trust these results? In order to fully prove the thesis, we must show that these numbers are not by chance. This is done by accessing the accuracy of our model. In doing so, we can more persuasively show that higher Internet usage is associated with higher county voting participation. The next section is dedicated to just that, assessing model accuracy. The first way I assessed model accuracy was through plots .
Here we can see the four standard plots for my linear regression model. In the Residuals vs Fitted, the residuals are a little worse at the extremes, but overall are not bad in the dense areas. Also, there aren’t any non-linear patterns with the resduals. The nextgraph to look at is the normal QQ plot. This is a good way to determine if the erorr terms are normally distributed. For the most part it looks like they are. The scale vs location plot is a good way to determine if we have homeostasis error terms. It looks like we do not. The residuals get worse on the extremes, and are lowest where the data is most clustered. The last plot, the Residuals vs Leverage, shows that there aren’t any super ‘important’ data points that tug on our regression line. The Cook’s distance was very small, and the largest leverage point is very close to the 0 standard residual point on the Y-axis. Generally, these plots are a good sign that our error terms are fairly normally distributed, although they are a bit heteroscedastic.
The next way I evaluated the model was by looking at simple model statistics. The model has a residual standard error of .0585 meaning we can expect that on average our model will be off by 5.85%. An adjusted R squared of .4795 tells us that our model is only decent at explaining the variation of the data.
The next method I used to assess model accuracy was cross-validation. For this process, I split my data into two sets. The first set had 90% of my data, called the training data, in which I used to fit my model. By fit my model, I mean that I fit my model from the previous section to the response data in this set. After doing so, I used the remaining 10% of my data, referred to as the test data, to evaluate the accuracy of my model. For my predictions, I had a standard prediction error of 6.37%. Similar to mean squared error, this can be interrupted as that we will be on average 6.37% off for our predictions using the model fitted with the training data. In another test to see how accurate my model was, I attached a binary measurement to each county called National Average. The variable is assigned a one if a county has voter participation above 41% (the percent of people who voted in the 2012 election) and a 0 otherwise. I then fit a logistic model, on the training data, using the same predictor variables as my linear model. The logisitic model output the percent chance a county had to at least match the national average voting rate The model correctly predicted whether or not a county met the national average for 207 out of 285,or 75% of the test data. So while it may not be very precise in predicting a counties voting rate, it can for the most part tell if it will be above or below average. On the whole, cross validation showed that our model was mildly accurate.
The last thing I did to test for model accuracy was bootstrap all of my predictor variables to see their distribution given a random data set obtained from random sampling of my data set. The bootstrap distribution of the coefficient of a given predictor variable shows how “reliable” our predictor variable is. A wide or varied distribution of the coefficient would tell us that we can not trust the coefficient returned by the model, since it is likely that a different data set would return a much differnet number. Below are the bootstrap distributions of all of predictor variables.
As can be seen, all of the bootstrapped distribution do not have large variances. Accompanied with each distribution, is a QQ plot, verifying that our distribution is normally distributed. For a quantitative measurement, all predictor variables (besides the intercept) have a standard error below .01. Thus, bootstrapping was a positive in terms of proving that our model is accurate.
Taking into account model plots, model statistics, cross-validation, and bootstrapping, we can conclude that these methods did more to show that the model is accurate than inaccurate. So how does model accuracy relate back to our discussion about Internet usage and voting? It tells us that we can trust the results we got that higher Internet usage is correlated with higher county voting participation.
And now time for the million-dollar question, is higher Internet usage causing the higher voter participation. The answer is I don’t now. Given the data I have and the analysis in this paper, the only thing I can say with some degree of confidence is that higher county Internet usage is correlated with higher county voting participation, albeit marginally. This was shown in the model, and confirmed via model accuracy. But it said nothing about causation. One could make the following case: As for why, Americans who frequently use the Internet are aware of the current events of the time. During an election year, they are more exposed to election coverage and news via online news sources. Not only that, but election coverage can also dominate social media circles. As is human nature, this causes people to want to fit in with what everyone else is doing, and thus makes them more likely to vote. And while this may be true, there are just too many confounding variables that are not being taken into account for us to say with certainty this is true. There are things about people who live in these counties that might just cause them to vote more than the average American anyways. For example, they’re probably people who live in cities, and people who live in cities simply vote more than people who live in rural comminutes. Note, I don’t know if this statement is actually true or not, it simply to illustrate the point that it is an almost impossible task, given the available data, to isolate the true effect Internet usage has on voting rates. In addition to density population here are some other confounders that I did not include in my model: Education, Race, Gender, Geography, the Candidates, Family Status, Cultural differences, and Employment. This is obviously not a compressive list, and there are many more that I cannot even think of.
In addition to excluding cofounders, there were other limitations I encountered in this project. To calculate voter participation for a county I added the amount of people who voted for Barack Obama or Mitt Romney and divided that by the county population. There are two problems with this method. Number one, instead of dividing by population, I should have divided by the population of people eligible to vote. Counties that had an unusual amount of people who were ineligible to vote, would be unfairly punished in my model. Number two, I am not taking into account people who voted for people besides Barack Obama or Mitt Romney. This mistake is minuscule though considering they accounted for over 98% of the votes. Another limitation was the data. I was limited in the data I could collect by the data I could readily find available online. In addition, the data collected had error in it. All data collected from ACS was an estimate, not an exact number. The biggest limitation I had was that my data came from 2012. In the last four years, the Internet has exploded even more into our lives. Thus, anyone interetsed in this subject is encouraged to investigate more recent data to see how things have channged in the last four years.